Rats of New York¶

Project Goals¶

Link to our GitHub

The team, Miranda Diaz and Raiya Dhawala will likely be investigating data sets related to rat sightings in New York City and how they have been affected by COVID-19, and the frequency by bourough.

Project Dataset¶

The data set we are working with is Rat Sightings. This data set contains information on rat sightings in New York City from 2010 to present day based on service requests made by the community. The data is from 311 service requests. Each row is different rat sighting with a unique key, which describes the unique identifier of the Service Request and in theory a different rat, although there is no way to be sure. Some columns describe the address, the date, the streets, the landmark and the borough of where each rat was seen. There is also a column which shows when the service request was created. We used this to divide rat sightings by time, day, month and year. We were interested in using this data set because it provides data that can be relevant when answering different questions, such as, “How have rat sightings increased or decreased in different boroughs since COVID-19”. We brought in further data to assess COVID-19 numbers in different New York City neighborhoods. We also brought in NYC population data to analyze rat sightings per 1000 people in different boroughs. Further data may look at the population of different neighborhoods, and whether certain neighborhoods make more 311 requests than others.

Collaboration Plan¶

Our collaboration plan is to meet on Thursdays around 4:30 to work on the final project. We will be communicating through text messages to keep each other updated and organized. We will be using google docs to draft things and then transfer it into google colab.

ETL (Extraction, Transform, and Load)¶

We loaded our dataset into the dataset folder. This dataset originally has 38 columns showing things such as the date the request was created, the agency name, the location (including neighborhood and address), and the status of the complaint. We tidied the data by getting rid of the repetitive location columns such as cross street 1 and 2, and only keeping the incident address and neighborhood. We then created a month, year column and set the date as the index.

In [20]:
#import pandas and matplotlib

import pandas as pd
import matplotlib.pyplot as plt

#mount drive
from google.colab import drive

drive.mount('/content/drive')

#read in rat csv
rat_df = pd.read_csv('/content/drive/My Drive/Rat_Sightings.csv')

#display it
rat_df.head()
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
<ipython-input-20-de1a53ac8c63>:12: DtypeWarning:

Columns (20) have mixed types. Specify dtype option on import or set low_memory=False.

Out[20]:
Unique Key Created Date Closed Date Agency Agency Name Complaint Type Descriptor Location Type Incident Zip Incident Address ... Vehicle Type Taxi Company Borough Taxi Pick Up Location Bridge Highway Name Bridge Highway Direction Road Ramp Bridge Highway Segment Latitude Longitude Location
0 15636547 01/02/2010 09:38:29 AM NaN DOHMH Department of Health and Mental Hygiene Rodent Rat Sighting Other (Explain Below) 11432.0 NaN ... NaN NaN NaN NaN NaN NaN NaN 40.703342 -73.800203 (40.70334174980328, -73.80020318978804)
1 15636689 01/02/2010 07:09:56 AM NaN DOHMH Department of Health and Mental Hygiene Rodent Rat Sighting Catch Basin/Sewer 11204.0 NaN ... NaN NaN NaN NaN NaN NaN NaN 40.608288 -73.977775 (40.60828832742201, -73.97777515791093)
2 15636710 01/02/2010 09:04:46 AM NaN DOHMH Department of Health and Mental Hygiene Rodent Rat Sighting 3+ Family Mixed Use Building 11375.0 68-12 YELLOWSTON BOULEVARD ... NaN NaN NaN NaN NaN NaN NaN 40.726060 -73.848329 (40.726059538480236, -73.84832900182792)
3 15636731 01/02/2010 06:27:59 PM NaN DOHMH Department of Health and Mental Hygiene Rodent Rat Sighting 3+ Family Mixed Use Building 11233.0 1040 HERKIMER STREET ... NaN NaN NaN NaN NaN NaN NaN 40.677957 -73.918921 (40.67795748580213, -73.91892129635086)
4 15636907 01/02/2010 12:50:16 PM NaN DOHMH Department of Health and Mental Hygiene Rodent Rat Sighting 3+ Family Apt. Building 10034.0 241 SHERMAN AVENUE ... NaN NaN NaN NaN NaN NaN NaN 40.866008 -73.919401 (40.86600810878767, -73.9194006525354)

5 rows × 38 columns

We made the formatted date the index of the rat dataframe so that it was easier to manipulate with other datasets such as the COVID data.

In [21]:
#get rid of unnecesary columns
rat_df.drop(columns = ['Closed Date','Agency', 'Cross Street 1', 'Cross Street 2', 'Intersection Street 1', 'Intersection Street 2', 'Address Type', 'Landmark', 'Status','Due Date','Resolution Action Updated Date', 'Street Name', 'City', 'Facility Type', 'Due Date', 'X Coordinate (State Plane)', 'Y Coordinate (State Plane)', 'Vehicle Type', 'Taxi Company Borough', 'Taxi Pick Up Location', 'Bridge Highway Name', 'Bridge Highway Direction', 'Road Ramp', 'Bridge Highway Segment', 'Location' ], inplace=True)

#use to_datetime function to change Created Date to Datetime format so we can extract month and year
rat_df['Created Date'] = pd.to_datetime(rat_df['Created Date'], format='%m/%d/%Y %I:%M:%S %p')

#Extract month and year into new columns
rat_df['Month'] = rat_df['Created Date'].dt.month
rat_df['Year'] = rat_df['Created Date'].dt.year
rat_df['Day'] = rat_df['Created Date'].dt.day

rat_df['formatted_date'] = pd.to_datetime(rat_df[['Month', 'Day', 'Year']]).dt.strftime('%m/%d/%Y')
rat_df['formatted_date'] = pd.to_datetime(rat_df['formatted_date'], format='%m/%d/%Y')

#set Created Date as index
rat_df.set_index("formatted_date", inplace=True)
rat_df.head()
Out[21]:
Unique Key Created Date Agency Name Complaint Type Descriptor Location Type Incident Zip Incident Address Community Board Borough Park Facility Name Park Borough Latitude Longitude Month Year Day
formatted_date
2010-01-02 15636547 2010-01-02 09:38:29 Department of Health and Mental Hygiene Rodent Rat Sighting Other (Explain Below) 11432.0 NaN 12 QUEENS QUEENS Unspecified QUEENS 40.703342 -73.800203 1 2010 2
2010-01-02 15636689 2010-01-02 07:09:56 Department of Health and Mental Hygiene Rodent Rat Sighting Catch Basin/Sewer 11204.0 NaN 11 BROOKLYN BROOKLYN Unspecified BROOKLYN 40.608288 -73.977775 1 2010 2
2010-01-02 15636710 2010-01-02 09:04:46 Department of Health and Mental Hygiene Rodent Rat Sighting 3+ Family Mixed Use Building 11375.0 68-12 YELLOWSTON BOULEVARD 06 QUEENS QUEENS Unspecified QUEENS 40.726060 -73.848329 1 2010 2
2010-01-02 15636731 2010-01-02 18:27:59 Department of Health and Mental Hygiene Rodent Rat Sighting 3+ Family Mixed Use Building 11233.0 1040 HERKIMER STREET 03 BROOKLYN BROOKLYN Unspecified BROOKLYN 40.677957 -73.918921 1 2010 2
2010-01-02 15636907 2010-01-02 12:50:16 Department of Health and Mental Hygiene Rodent Rat Sighting 3+ Family Apt. Building 10034.0 241 SHERMAN AVENUE 12 MANHATTAN MANHATTAN Unspecified MANHATTAN 40.866008 -73.919401 1 2010 2
In [22]:
#display datatypes
display(rat_df.dtypes)
Unique Key                     int64
Created Date          datetime64[ns]
Agency Name                   object
Complaint Type                object
Descriptor                    object
Location Type                 object
Incident Zip                 float64
Incident Address              object
Community Board               object
Borough                       object
Park Facility Name            object
Park Borough                  object
Latitude                     float64
Longitude                    float64
Month                          int64
Year                           int64
Day                            int64
dtype: object

We created a data table with rat sighting count per day so as to compare to the COVID daily cases with daily rat sightings

In [23]:
#Make dataframe which has amount of rats spotted per day
rat_per_day_df = rat_df.groupby('formatted_date').size().reset_index(name='rat_sightings_count')
rat_per_day_df.head()
Out[23]:
formatted_date rat_sightings_count
0 2010-01-01 9
1 2010-01-02 12
2 2010-01-03 3
3 2010-01-04 24
4 2010-01-05 14

Exploring Rat Populations Across Boroughs¶

Here we see the total amount of rats reproted in each borough, with Brooklyn having the highest count of rats. But in oder to get a better perspective, why dont we look at this on a map?

In [24]:
#show frequency of rat sightings by borough
rat_df['Borough'].value_counts()
Out[24]:
BROOKLYN         84208
MANHATTAN        61172
BRONX            42182
QUEENS           34808
STATEN ISLAND     9250
Unspecified         25
Name: Borough, dtype: int64
In [25]:
#new dataframe showing boroughs
borough_df = pd.DataFrame({'Borough':['Brooklyn','Manhattan','Bronx', 'Queens', 'Staten Island', 'Unspecified'], 'Rat Sightings': [83461, 60699, 41917, 34507,9213,33]})
borough_df.head(5)
Out[25]:
Borough Rat Sightings
0 Brooklyn 83461
1 Manhattan 60699
2 Bronx 41917
3 Queens 34507
4 Staten Island 9213

Now the figure below gives us a better visual of just how many rats have actually been reported throughout the last few decade. Each color represents a different borough and each dot represents a different report. But where exactly have the most rats been reported in each borough? A better quesiton is, where should you avoid hanging out in NYC?

In [ ]:
import plotly.express as px

#don't include rows with no latitue or longitude
rat_df = rat_df.dropna(subset=['Latitude', 'Longitude'])

#don't include unspecified
filtered_rat_df = rat_df[rat_df['Borough'] != 'Unspecified']

#plotting the map with different colors for each borough
fig = px.scatter_mapbox(filtered_rat_df, lat='Latitude', lon='Longitude', color='Borough', mapbox_style='open-street-map')
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

(Image of figure in case it can't be seen when file is converted to HTML)

newplot.png

Now, what are the spots with the most frequent rat sightings in each borough? The figure below uses the latitude and longitude of where the rat was sighted to determine just that.

In [ ]:
#remove rows with missing latitude or longitude values
rat_df = rat_df.dropna(subset=['Latitude', 'Longitude'])

#do not include reports that say unspecfied borough
filtered_rat_df = rat_df[rat_df['Borough'] != 'Unspecified']

#group data by Borough and then find latitude and longitude with the most sightings
most_sightings = filtered_rat_df.groupby('Borough')[['Latitude', 'Longitude']].max().reset_index()

#make map showing which latitute and logitude had the most rat sightings for each borough
most_sightings['MarkerSize'] = 10
fig = px.scatter_mapbox(most_sightings,lat='Latitude',lon='Longitude', hover_name='Borough',zoom=10, color='Borough', mapbox_style='open-street-map', size='MarkerSize')
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

(Image of figure incase it cannot be seen when turned into HTML)

newplot.png

Over time, rat sightings have increased, as seen in the plot below. This could be due to multiple factors including the amount of people and those who are inclined that call 311 to report a rat. There is a notable dip in 2020 and 2021 probably due to COVID-19 and the fact that people were not goign outside as much. Now, in 2023, we have seen the most rat sightings reported up to date!

In [26]:
#display rat frequency per year
rat_df.Year.plot.hist()
Out[26]:
<Axes: ylabel='Frequency'>

But what good is it knowing the amount of rats if its not comparable to the size or population of each borough? To do that, we need to bring in more information about the population of each borough to calculate the amount of rats per 1000 people.

In [27]:
from google.colab import drive

population_df = pd.read_csv('/content/drive/My Drive/new_pop_again.csv')
population_df = population_df.drop([0]).reset_index(drop=True)
population_df['Population'] = population_df['Population'].str.replace(',', '').astype(int)

population_df.head(10)
Out[27]:
Index GeoType Borough GeoID Name Population
0 2 Boro Manhattan 1 Manhattan 1694251
1 3 Boro Bronx 2 Bronx 1472654
2 4 Boro Brooklyn 3 Brooklyn 2736074
3 5 Boro Queens 4 Queens 2405464
4 6 Boro Staten Island 5 Staten Island 495747
In [28]:
#combine borough and population then calculate per capita
result_df = pd.merge(borough_df, population_df, on='Borough')

#unspecified doesn't matter, it is so minimal
#delete unnecessary columns
result_df= result_df.drop(columns=['Index', 'GeoID', 'GeoType', 'Name'])

#rename rat sightings
result_df = result_df.rename(columns={'Rat Sightings': 'Rat_Sightings'})
result_df.head(7)
Out[28]:
Borough Rat_Sightings Population
0 Brooklyn 83461 2736074
1 Manhattan 60699 1694251
2 Bronx 41917 1472654
3 Queens 34507 2405464
4 Staten Island 9213 495747
In [29]:
#per capita rat sightings

result_df['per_capita_1000'] = (result_df['Rat_Sightings']/result_df['Population'])*1000
result_df.head(8)
Out[29]:
Borough Rat_Sightings Population per_capita_1000
0 Brooklyn 83461 2736074 30.503926
1 Manhattan 60699 1694251 35.826451
2 Bronx 41917 1472654 28.463577
3 Queens 34507 2405464 14.345257
4 Staten Island 9213 495747 18.584076

Using this newfound information, we can plot to analyze.

In [30]:
#result_df = result_df.set_index('Borough')
#plot
result_df.Rat_Sightings.plot.bar()

#label
plt.xlabel('Borough')
plt.ylabel('Rat Sightings')
plt.title('Rat Sightings by Borough')
plt.show()

#other graph for per capita rat sightings
result_df.per_capita_1000.plot.bar()
plt.xlabel('Borough')
plt.ylabel('Per 1000 people Rat Sightings')
plt.title('Per 1000 capita Rat Sightings by Borough')
plt.show()

So, although Brooklyn has the largest population of humans, when calculating the per capita rat sightings, Manhattan has the most rat sightings!

So, what's next?¶

The next point of interest we will be tackling is how rat sightings have changed before and after the pandemic. To analyze that, we will be bringing in COVID-19 data that displays the case count by date and by borough in New York City.

In [31]:
#from google.colab import drive

#drive.mount('/content/drive')
COVID_df = pd.read_csv('/content/drive/My Drive/COVID_Daily_Counts.csv')

columns = ['date_of_interest', 'CASE_COUNT', 'BX_CASE_COUNT', 'BK_CASE_COUNT', 'MN_CASE_COUNT', 'QN_CASE_COUNT', 'SI_CASE_COUNT']
new_COVID_df = COVID_df[columns]

new_COVID_df[new_COVID_df.columns.get_loc('date_of_interest')] = pd.to_datetime(new_COVID_df['date_of_interest'], format='%m/%d/%Y')

new_COVID_df.head()
<ipython-input-31-18fa6543d91f>:9: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Out[31]:
date_of_interest CASE_COUNT BX_CASE_COUNT BK_CASE_COUNT MN_CASE_COUNT QN_CASE_COUNT SI_CASE_COUNT 0
0 02/29/2020 1 0 0 1 0 0 2020-02-29
1 03/01/2020 0 0 0 0 0 0 2020-03-01
2 03/02/2020 0 0 0 0 0 0 2020-03-02
3 03/03/2020 1 0 0 0 1 0 2020-03-03
4 03/04/2020 5 0 1 2 2 0 2020-03-04
In [32]:
#merge covid cases and rat df
new_COVID_df['date_of_interest'] = pd.to_datetime(new_COVID_df['date_of_interest'], format='%m/%d/%Y')

rat_COVID_df = pd.merge(rat_per_day_df, new_COVID_df, left_on='formatted_date', right_on='date_of_interest', how='left')

# Drop the duplicate 'formatted_date' column from the COVID DataFrame
rat_COVID_df = rat_COVID_df.drop('date_of_interest', axis=1)

# Display the merged DataFrame
rat_COVID_df.head()
<ipython-input-32-68eaf60e929d>:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Out[32]:
formatted_date rat_sightings_count CASE_COUNT BX_CASE_COUNT BK_CASE_COUNT MN_CASE_COUNT QN_CASE_COUNT SI_CASE_COUNT 0
0 2010-01-01 9 NaN NaN NaN NaN NaN NaN NaT
1 2010-01-02 12 NaN NaN NaN NaN NaN NaN NaT
2 2010-01-03 3 NaN NaN NaN NaN NaN NaN NaT
3 2010-01-04 24 NaN NaN NaN NaN NaN NaN NaT
4 2010-01-05 14 NaN NaN NaN NaN NaN NaN NaT

The following plot shows rat sightings as compared to COVID Case count per day from September 1st 2019 to August 30th 2023. This shows us how what the rat sightings looked like pre covid, and after covid. This plot shows us that rat sightings have increased after COVID because in 2019, the sightings were lower on average. As expected, the when COVID cases increased, rat sightings decreased which can probably be attributed to people not going outside when there were more cases.

In [33]:
rat_COVID_df.index = pd.to_datetime(rat_COVID_df.index)
date_column = 'formatted_date'

#filtering data by date
start_date = '2019-09-01'
end_date = '2023-08-30'
rat_COVID_zoomed = rat_COVID_df[(rat_COVID_df[date_column] >= start_date) & (rat_COVID_df[date_column] <= end_date)]


plt.figure(figsize=(18, 12))

#COVID Cases
plt.plot(rat_COVID_zoomed['formatted_date'], rat_COVID_zoomed['CASE_COUNT'], label='COVID Cases', color='blue')
plt.xlabel('Date')

#SECOND y axis for rat sightings so rat sightings doesnt look like nothing
ax2 = plt.gca().twinx()
ax2.plot(rat_COVID_zoomed['formatted_date'], rat_COVID_zoomed['rat_sightings_count'], label='Rat Sightings per Day', color='red')
ax2.set_ylabel('Rat Sightings per Day')

# Adding title and naming axis
plt.title('COVID Cases and Rat Sightings per day From 2019-2023')
plt.ylabel('COVID Cases')
plt.legend()
plt.grid(True)
plt.show()

The following plot shows the amount of COVID-19 cases throughout time and per borough. There was a high spike in cases around January 2022. We will use the borough COVID data and compare it to the borough rat sightings we analyzed above in the furutre.

In [34]:
import matplotlib.pyplot as plt
import pandas as pd

date_column = 'date_of_interest'
new_COVID_df[date_column] = pd.to_datetime(COVID_df[date_column])

#filtering data by date
start_date = '2019-09-01'
end_date = '2023-08-30'
filtered_df = new_COVID_df[(new_COVID_df[date_column] >= start_date) & (new_COVID_df[date_column] <= end_date)]

boroughs = ['BX_CASE_COUNT', 'BK_CASE_COUNT', 'MN_CASE_COUNT', 'QN_CASE_COUNT', 'SI_CASE_COUNT']

#resizing to see data better
plt.figure(figsize=(15, 12))

#plotting
for borough in boroughs:
    plt.plot(filtered_df[date_column], filtered_df[borough], label=borough)

#adding title and naming axis
plt.title('COVID Cases vs. Time by Borough')
plt.xlabel('Date')
plt.ylabel('Case Count')
plt.legend()
plt.grid(True)
plt.show()
<ipython-input-34-c2cc92f122e8>:5: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In future analysis, we may also attempt to use other features of the rat data to analyze where you can expect to see the most rats in the future. We will use KNN analysis to do so in our future milestone. Another question that we want to answer is

In [35]:
import seaborn as sns

correlation_matrix = rat_df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.show()
<ipython-input-35-1eef80d69bb4>:3: FutureWarning:

The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.